{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Oversubscribing GPU memory in cuGraph\n",
    "#### Author : Alex Fender\n",
    "\n",
    "In this notebook, we will show how to **scale to 4x larger graphs than before** without performance drop using managed memory features in cuGraph. We will compute the PageRank of each user in Twitter's dataset on a single GPU as an example. This technique applies to all features.\n",
    "\n",
    "Unified Memory is a single memory address space accessible from any processor in a system. If a kernel tries to access any absent pages,the Page Migration Engine migrates the pages. When the GPU memory is full, least recently used pages are evicted. In other words, Unified Memory transparently enables oversubscribing GPU memory, enabling out-of-core computations.\n",
    "\n",
    "\n",
    "This notebook was tested on an NVIDIA 48GB RTX8000 GPU using RAPIDS 0.14 and CUDA 10.2. Please be aware that your system may be different, and you may need to modify the code or install packages to run the below examples. If you think you have found a bug or an error, please file an issue in [cuGraph](https://github.com/rapidsai/cugraph/issues)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "We will be analyzing **1.47 billion social relations** on 41.7 million user profiles from the Twitter dataset.  The CSV file is 26GB and was collected in :<br>\n",
    "*What is Twitter, a social network or a news media? Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010.*<br> \n",
    "\n",
    "Notice that the memory requirement to read this 26GB dataset is already bigger than the memory of a single GPU. While we are not limited by the device memory size in this case, the whole system (host+device memory) should still have at least 80GB of memory available"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PageRank with cuGraph\n",
    "### Basic setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import needed libraries. We recommend using cugraph_dev env through conda\n",
    "import time\n",
    "import rmm \n",
    "import cudf \n",
    "import cugraph "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Twitter dataset is in our S3 bucket and zipped.  \n",
    "1. We'll need to create a folder for our data in the `/data` folder\n",
    "1. Download the zipped data into that folder from S3 (it will take some time as it it 6GB)\n",
    "1. Decompress the zipped data for use (it will take some time as it it 26GB)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "import urllib.request\n",
    "import os\n",
    "\n",
    "data_dir = '../data/'\n",
    "if not os.path.exists(data_dir):\n",
    "    print('creating data directory')\n",
    "    os.system('mkdir ../data')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# download the Twitter dataset\n",
    "base_url = 'https://s3.us-east-2.amazonaws.com/rapidsai-data/cugraph/benchmark/'\n",
    "fn = 'twitter-2010.csv'\n",
    "comp = '.gz'\n",
    "if not os.path.isfile(data_dir+fn):\n",
    "    if not os.path.isfile(data_dir+fn+comp):\n",
    "        print(f'Downloading {base_url+fn+comp} to {data_dir+fn+comp}')\n",
    "        urllib.request.urlretrieve(base_url+fn+comp, data_dir+fn+comp)\n",
    "    print(f'Decompressing {data_dir+fn+comp}...')\n",
    "    os.system('gunzip '+data_dir+fn+comp)\n",
    "    print(f'{data_dir+fn+comp} decompressed!')\n",
    "else:\n",
    "    print(f'Your data file, {data_dir+fn}, already exists')\n",
    "\n",
    "# File path, assuming Notebook directory\n",
    "input_data_path = data_dir+fn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize RMM\n",
    "RAPIDS Memory Manager (RMM) is a  central place for all device memory allocations in RAPIDS libraries. Using RMM in python code is straightforward. Simply `import rmm` and configure RMM options to be used before making any call to a RAPIDS library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Set RMM to allocate all memory as managed memory (cudaMallocManaged underlying allocator)\n",
    "rmm.reinitialize(managed_memory=True)\n",
    "assert(rmm.is_initialized())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the data from disk\n",
    "cuGraph depends on cudf for data loading and the initial DataFrame creation. The CSV data file contains an edge list, which represents the connection of a vertex to another. The source to destination pairs is what is known as Coordinate Format (COO). In this test case, the data is just two columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Start timer\n",
    "t_start = time.time()\n",
    "\n",
    "# CSV reader\n",
    "e_list = cudf.read_csv(input_data_path, delimiter=' ', names=['src', 'dst'], dtype=['int32', 'int32'])\n",
    "\n",
    "# Print time\n",
    "print(\"Reader: \", time.time()-t_start, \"s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a graph\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t_start = time.time()\n",
    "\n",
    "# Create a directed graph using the source (src) and destination (dst) vertex pairs from the Dataframe \n",
    "G = cugraph.DiGraph()\n",
    "G.from_cudf_edgelist(e_list, source='src', destination='dst', renumber=False)\n",
    "\n",
    "# (optional) request the transposed here so that we can analyse pagerank solver time alone\n",
    "G.view_transposed_adj_list()\n",
    "\n",
    "# Print time\n",
    "print(\"Load and transpose: \", time.time()-t_start, \"s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Call PageRank algorithm\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Start timer\n",
    "t_start = time.time()\n",
    "\n",
    "# Get the pagerank scores\n",
    "pr_df = cugraph.pagerank(G, tol=1e-4)\n",
    "\n",
    "# Print time\n",
    "print(\"Pagerank: \", time.time()-t_start, \"s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It was that easy! PageRank should only take a few seconds to run on this 26GB input with one GPU.<br>\n",
    "Check out how it compares to published Spark results in the [Annex](#annex_cell)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Further analysis on the PageRank result\n",
    "\n",
    "We can now identify the most influent users in the network.<br>\n",
    "Notice that the PageRank result is already in a regular `cudf.DataFrame`. We can then sort by PageRank value and print the *Top 3*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Start timer\n",
    "t_start = time.time()\n",
    "\n",
    "# Sort, descending order\n",
    "pr_sorted_df = pr_df.sort_values('pagerank',ascending=False)\n",
    "\n",
    "# Print time\n",
    "print(time.time()-t_start, \"s\")\n",
    "\n",
    "# Print the Top 3\n",
    "print(pr_sorted_df.head(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now use the [map](https://s3.us-east-2.amazonaws.com/rapidsai-data/cugraph/benchmark/twitter-2010-ids.csv.gz) to convert Vertex ID into to Twitter's numeric ID. The user name can also be retrieved using the [TwitterID](https://tweeterid.com/) web app.<br>\n",
    "The table below shows more information on our *Top 3*. Notice that this ranking is much better at capturing network influence compared the number of followers for instance. Further analysis of this dataset was published [here](https://doi.org/10.1145/1772690.1772751).\n",
    "\n",
    "| Vertex ID\t| Twitter ID\t| User name\t| Description |\n",
    "| --------- |  ---------   | --------   |   ----------  |\n",
    "| 21513299\t| 813286\t| barackobama\t| US President (2009-2017) |\n",
    "| 23933989\t| 14224719\t| 10DowningStreet | UK Prime Minister office |\n",
    "| 23933986\t| 15131310\t| WholeFoods\t| Food store from Austin |\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Annex\n",
    "<a id='annex_cell'></a>\n",
    "An experiment comparing various porducts for this workflow was published in *GraphX: Graph Processing in a Distributed Dataflow Framework,OSDI, 2014*. They used 16 m2.4xlarge worker nodes on Amazon EC2. There was a total of 128 CPU cores and 1TB of memory in this 2014 setup.\n",
    "\n",
    "![twitter-2010-spark.png](twitter-2010-spark.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "Copyright (c) 2020, NVIDIA CORPORATION.\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
    "___"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
